Marks: 40
The number of restaurants in New York is increasing day by day. Lots of students and busy professionals rely on those restaurants due to their hectic lifestyles. Online food delivery service is a great option for them. It provides them with good food from their favorite restaurants. A food aggregator company FoodHub offers access to multiple restaurants through a single smartphone app.
The app allows the restaurants to receive a direct online order from a customer. The app assigns a delivery person from the company to pick up the order after it is confirmed by the restaurant. The delivery person then uses the map to reach the restaurant and waits for the food package. Once the food package is handed over to the delivery person, he/she confirms the pick-up in the app and travels to the customer's location to deliver the food. The delivery person confirms the drop-off in the app after delivering the food package to the customer. The customer can rate the order in the app. The food aggregator earns money by collecting a fixed margin of the delivery order from the restaurants.
The food aggregator company has stored the data of the different orders made by the registered customers in their online portal. They want to analyze the data to get a fair idea about the demand of different restaurants which will help them in enhancing their customer experience. Suppose you are hired as a Data Scientist in this company and the Data Science team has shared some of the key questions that need to be answered. Perform the data analysis to find answers to these questions that will help the company to improve the business.
The data contains the different data related to a food order. The detailed data dictionary is given below.
# import libraries for data manipulation
import numpy as np
import pandas as pd
# import libraries for data visualization
import matplotlib.pyplot as plt
import seaborn as sns
# to restrict the float value to 3 decimal places
pd.set_option('display.float_format', lambda x: '%.3f' % x)
# uncomment and run the following lines for Google Colab
# from google.colab import drive
# drive.mount('/content/drive')
# read the data
df = pd.read_csv('foodhub_order.csv')
# returns the first 5 rows
df.head()
| order_id | customer_id | restaurant_name | cuisine_type | cost_of_the_order | day_of_the_week | rating | food_preparation_time | delivery_time | |
|---|---|---|---|---|---|---|---|---|---|
| 0 | 1477147 | 337525 | Hangawi | Korean | 30.750 | Weekend | Not given | 25 | 20 |
| 1 | 1477685 | 358141 | Blue Ribbon Sushi Izakaya | Japanese | 12.080 | Weekend | Not given | 25 | 23 |
| 2 | 1477070 | 66393 | Cafe Habana | Mexican | 12.230 | Weekday | 5 | 23 | 28 |
| 3 | 1477334 | 106968 | Blue Ribbon Fried Chicken | American | 29.200 | Weekend | 3 | 25 | 15 |
| 4 | 1478249 | 76942 | Dirty Bird to Go | American | 11.590 | Weekday | 4 | 25 | 24 |
The DataFrame has 9 columns as mentioned in the Data Dictionary. Data in each row corresponds to the order placed by a customer.
# Write your code here
df.shape
(1898, 9)
There are 1898 rows and 9 columns in the dataset.
# Use info() to print a concise summary of the DataFrame
df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 1898 entries, 0 to 1897 Data columns (total 9 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 order_id 1898 non-null int64 1 customer_id 1898 non-null int64 2 restaurant_name 1898 non-null object 3 cuisine_type 1898 non-null object 4 cost_of_the_order 1898 non-null float64 5 day_of_the_week 1898 non-null object 6 rating 1898 non-null object 7 food_preparation_time 1898 non-null int64 8 delivery_time 1898 non-null int64 dtypes: float64(1), int64(4), object(4) memory usage: 133.6+ KB
#lets see if there any null values in the data set
df.isnull().sum()
order_id 0 customer_id 0 restaurant_name 0 cuisine_type 0 cost_of_the_order 0 day_of_the_week 0 rating 0 food_preparation_time 0 delivery_time 0 dtype: int64
#Now lets check why the rating column is object type
df['rating'].unique()
array(['Not given', '5', '3', '4'], dtype=object)
#seems like there are customers who havent given any rating on their order, lets count their total number & their percentage
df.loc[df['rating']=='Not given','rating'].value_counts(), df.loc[df['rating']=='Not given','rating'].value_counts()*100/df.shape[0]
(Not given 736 Name: rating, dtype: int64, Not given 38.777661 Name: rating, dtype: float64)
#lets replace the data with nan values
df['rating']=df['rating'].replace('Not given',np.nan)
#lets conver the data to float type
df['rating']=df['rating'].astype(float)
df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 1898 entries, 0 to 1897 Data columns (total 9 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 order_id 1898 non-null int64 1 customer_id 1898 non-null int64 2 restaurant_name 1898 non-null object 3 cuisine_type 1898 non-null object 4 cost_of_the_order 1898 non-null float64 5 day_of_the_week 1898 non-null object 6 rating 1162 non-null float64 7 food_preparation_time 1898 non-null int64 8 delivery_time 1898 non-null int64 dtypes: float64(2), int64(4), object(3) memory usage: 133.6+ KB
# Write your code here
df.describe().T
| count | mean | std | min | 25% | 50% | 75% | max | |
|---|---|---|---|---|---|---|---|---|
| order_id | 1898.000 | 1477495.500 | 548.050 | 1476547.000 | 1477021.250 | 1477495.500 | 1477969.750 | 1478444.000 |
| customer_id | 1898.000 | 171168.478 | 113698.140 | 1311.000 | 77787.750 | 128600.000 | 270525.000 | 405334.000 |
| cost_of_the_order | 1898.000 | 16.499 | 7.484 | 4.470 | 12.080 | 14.140 | 22.297 | 35.410 |
| rating | 1162.000 | 4.344 | 0.741 | 3.000 | 4.000 | 5.000 | 5.000 | 5.000 |
| food_preparation_time | 1898.000 | 27.372 | 4.632 | 20.000 | 23.000 | 27.000 | 31.000 | 35.000 |
| delivery_time | 1898.000 | 24.162 | 4.973 | 15.000 | 20.000 | 25.000 | 28.000 | 33.000 |
# Write the code here
df['rating'].isnull().sum()
736
df['rating'].isnull().sum()*100/df.shape[0]
38.77766069546892
*There are 736 orders which are not rated, which account for almost 39% of the orders. We have already converted them to NaN values in the previous segment.
Analysis of Costs of the Orders
# Lets check first the distribution of cost_of_order
sns.histplot(data=df, x='cost_of_the_order',kde='True')
plt.show()
sns.boxplot(data=df, x='cost_of_the_order')
<Axes: xlabel='cost_of_the_order'>
#lets create price bins for prices between 5-15,15-25 & 25-35 to see if there is any pattern
df['price_bin']=pd.cut(df['cost_of_the_order'],bins=[5,15,25,35],labels=["Low","Medium","High"])
sns.histplot(data=df, x='price_bin')
<Axes: xlabel='price_bin', ylabel='Count'>
Observations on Ratings
sns.countplot(data=df, x='rating')
<Axes: xlabel='rating', ylabel='count'>
Observations on Food Preparation and Delivery Time
plt.figure(figsize = (5, 3))
sns.countplot(data=df, x='food_preparation_time');
plt.show()
plt.figure(figsize = (4, 2))
sns.boxplot(data=df,x='food_preparation_time')
plt.show()
plt.figure(figsize = (5, 3))
sns.countplot(data=df, x='delivery_time');
plt.show()
plt.figure(figsize = (4, 2))
sns.boxplot(data=df,x='delivery_time')
plt.show()
df.loc[df['delivery_time']>30, 'delivery_time'].value_counts().sum()
149
Observations
df['restaurant_name'].nunique(),df['cuisine_type'].nunique()
(178, 14)
sns.countplot(data=df, x='cuisine_type')
plt.xticks(rotation=90)
plt.show()
sns.countplot(data=df, x='price_bin', hue='day_of_the_week')
<Axes: xlabel='price_bin', ylabel='count'>
df.loc[df['day_of_the_week']=='Weekend','day_of_the_week'].value_counts()/df.shape[0]
Weekend 0.712 Name: day_of_the_week, dtype: float64
Observations on Week of Day & Cuisine Type
df['restaurant_name'].value_counts().head()
Shake Shack 219 The Meatball Shop 132 Blue Ribbon Sushi 119 Blue Ribbon Fried Chicken 96 Parm 68 Name: restaurant_name, dtype: int64
df['restaurant_name'].value_counts().shape[0]/df.shape[0]
0.09378292939936776
Observations:
df.loc[df['day_of_the_week']=='Weekend','cuisine_type'].value_counts().head()
American 415 Japanese 335 Italian 207 Chinese 163 Mexican 53 Name: cuisine_type, dtype: int64
df.loc[df['day_of_the_week']=='Weekday','cuisine_type'].value_counts().head()
American 169 Japanese 135 Italian 91 Chinese 52 Mexican 24 Name: cuisine_type, dtype: int64
df[df['cost_of_the_order']>20].shape[0]/df.shape[0]
0.2924130663856691
df[df['cost_of_the_order']>20].shape[0]
555
df['delivery_time'].mean()
24.161749209694417
df['delivery_time'].mean()+df['food_preparation_time'].mean()
51.53371970495259
df['customer_id'].value_counts().nlargest(10)
52832 13 47440 10 83287 9 250494 8 259341 7 82041 7 65009 7 276192 7 97079 6 97991 6 Name: customer_id, dtype: int64
df.columns
Index(['order_id', 'customer_id', 'restaurant_name', 'cuisine_type',
'cost_of_the_order', 'day_of_the_week', 'rating',
'food_preparation_time', 'delivery_time'],
dtype='object')
plt.figure(figsize=(8,4))
sns.heatmap(df[['cost_of_the_order','rating','delivery_time','food_preparation_time']].corr(),annot=True,vmin=-1,vmax=1);
plt.show()
Lets first check relationship of cost of orders to the categorical variables.
#lets observe the distribution of cost for each cuisine type
sns.boxplot(data=df,x='cuisine_type',y='cost_of_the_order')
plt.xticks(rotation=90)
plt.show()
#now lets check the cost of order on day of the week for each cuisine type
plt.figure(figsize=(3,1))
sns.relplot(data=df,x='day_of_the_week',y='cost_of_the_order',col='cuisine_type',kind='scatter',col_wrap=4)
plt.show()
<Figure size 300x100 with 0 Axes>
We can see that there is difference in number of orders in weekday and weekend for few of the cuisine types, lets explore it further
df.groupby(['cuisine_type', 'day_of_the_week']).agg({'cost_of_the_order': 'mean', 'order_id': 'nunique', 'rating': 'mean'})
| cost_of_the_order | order_id | rating | ||
|---|---|---|---|---|
| cuisine_type | day_of_the_week | |||
| American | Weekday | 15.308 | 169 | 4.252 |
| Weekend | 16.732 | 415 | 4.319 | |
| Chinese | Weekday | 17.148 | 52 | 4.231 |
| Weekend | 16.036 | 163 | 4.364 | |
| French | Weekday | 20.866 | 5 | 4.500 |
| Weekend | 19.382 | 13 | 4.250 | |
| Indian | Weekday | 17.162 | 24 | 4.429 |
| Weekend | 16.801 | 49 | 4.583 | |
| Italian | Weekday | 16.890 | 91 | 4.403 |
| Weekend | 16.211 | 207 | 4.336 | |
| Japanese | Weekday | 16.368 | 135 | 4.365 |
| Weekend | 16.279 | 335 | 4.377 | |
| Korean | Weekday | 12.180 | 2 | 4.500 |
| Weekend | 14.333 | 11 | 4.000 | |
| Mediterranean | Weekday | 14.040 | 14 | 4.000 |
| Weekend | 16.102 | 32 | 4.333 | |
| Mexican | Weekday | 16.556 | 24 | 4.375 |
| Weekend | 17.104 | 53 | 4.438 | |
| Middle Eastern | Weekday | 19.833 | 17 | 4.467 |
| Weekend | 18.283 | 32 | 4.053 | |
| Southern | Weekday | 19.227 | 6 | 4.000 |
| Weekend | 19.341 | 11 | 4.444 | |
| Spanish | Weekday | 12.130 | 1 | NaN |
| Weekend | 19.618 | 11 | 4.833 | |
| Thai | Weekday | 15.463 | 4 | 4.000 |
| Weekend | 20.207 | 15 | 4.750 | |
| Vietnamese | Weekday | 11.917 | 3 | 3.000 |
| Weekend | 13.608 | 4 | 4.667 |
Observations:
df['rating'].mean()
4.344234079173838
plt.figure(figsize=(5,5))
sns.boxplot(data=df,x='cuisine_type',y='rating')
plt.xticks(rotation=90)
plt.show()
#Lests compare mean prices for cuisine types and mean ratings
df_one=df.groupby(['cuisine_type']).agg({'rating': 'mean', 'cost_of_the_order': 'mean'})
sns.lmplot(data=df_one,x='cost_of_the_order',y='rating',ci=False)
<seaborn.axisgrid.FacetGrid at 0x7a5318434a30>
df_one.nlargest(10,columns='rating') , df_one.nlargest(10,columns='cost_of_the_order')
( rating cost_of_the_order
cuisine_type
Spanish 4.833 18.994
Thai 4.667 19.208
Indian 4.540 16.920
Mexican 4.417 16.933
Japanese 4.374 16.305
Italian 4.360 16.419
Chinese 4.338 16.305
Southern 4.308 19.301
French 4.300 19.794
American 4.299 16.320,
rating cost_of_the_order
cuisine_type
French 4.300 19.794
Southern 4.308 19.301
Thai 4.667 19.208
Spanish 4.833 18.994
Middle Eastern 4.235 18.821
Mexican 4.417 16.933
Indian 4.540 16.920
Italian 4.360 16.419
American 4.299 16.320
Chinese 4.338 16.305)
#lets see if there is any relationship between mean cost of the order and mean rating for the cuisine types
df_one[['cost_of_the_order','rating']].corr()
| cost_of_the_order | rating | |
|---|---|---|
| cost_of_the_order | 1.000 | 0.626 |
| rating | 0.626 | 1.000 |
#lets check the unrated orders for each cuisine type
df.groupby('cuisine_type',as_index=False)['rating'].apply(lambda x: x.isnull().sum())
| cuisine_type | rating | |
|---|---|---|
| 0 | American | 216 |
| 1 | Chinese | 82 |
| 2 | French | 8 |
| 3 | Indian | 23 |
| 4 | Italian | 126 |
| 5 | Japanese | 197 |
| 6 | Korean | 4 |
| 7 | Mediterranean | 14 |
| 8 | Mexican | 29 |
| 9 | Middle Eastern | 15 |
| 10 | Southern | 4 |
| 11 | Spanish | 6 |
| 12 | Thai | 10 |
| 13 | Vietnamese | 2 |
#lets see the distribution of ratings on day of the week
sns.countplot(data=df,x='rating',hue='day_of_the_week')
plt.show()
#lets check the unrated orders for day fo the week
df.groupby('day_of_the_week',as_index=False)['rating'].apply(lambda x: x.isnull().sum())
| day_of_the_week | rating | |
|---|---|---|
| 0 | Weekday | 207 |
| 1 | Weekend | 529 |
df.groupby(['day_of_the_week'])['order_id'].count()
day_of_the_week Weekday 547 Weekend 1351 Name: order_id, dtype: int64
#lets check the unrated orders for each cuisine type
df.groupby(['day_of_the_week'])['rating'].mean()
day_of_the_week Weekday 4.309 Weekend 4.359 Name: rating, dtype: float64
df.columns
Index(['order_id', 'customer_id', 'restaurant_name', 'cuisine_type',
'cost_of_the_order', 'day_of_the_week', 'rating',
'food_preparation_time', 'delivery_time', 'price_bin'],
dtype='object')
df.groupby(['price_bin']).agg({'rating':'mean','order_id':'nunique'})
| rating | order_id | |
|---|---|---|
| price_bin | ||
| Low | 4.312 | 999 |
| Medium | 4.402 | 585 |
| High | 4.338 | 304 |
promo=pd.Series(df.groupby('restaurant_name').filter(lambda group: group['rating'].mean() > 4 and group['rating'].count() > 50)['restaurant_name'].unique())
promo
0 Blue Ribbon Fried Chicken 1 The Meatball Shop 2 Shake Shack 3 Blue Ribbon Sushi dtype: object
# orders above 20 dollars revenue
rev_a=np.round(df.loc[df['cost_of_the_order']>20,'cost_of_the_order'].sum()*.25,2)
orders_a=df.loc[df['cost_of_the_order']>20,'order_id'].nunique()
rev_a,orders_a
(3688.73, 555)
#orders above 5 dollars
rev_b=np.round(df.loc[(df['cost_of_the_order']<=20) & (df['cost_of_the_order']>5),'cost_of_the_order'].sum()*.15,2)
orders_b=df.loc[(df['cost_of_the_order']<=20) & (df['cost_of_the_order']>5),'order_id'].nunique()
rev_b, orders_b
(2477.58, 1334)
#total revenue
total_rev=round(rev_a+rev_b,2)
total_rev
6166.31
# we can create a column named 'total_time'
df['total_time']=df['delivery_time']+df['food_preparation_time']
df.head(2)
| order_id | customer_id | restaurant_name | cuisine_type | cost_of_the_order | day_of_the_week | rating | food_preparation_time | delivery_time | price_bin | total_time | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1477147 | 337525 | Hangawi | Korean | 30.750 | Weekend | NaN | 25 | 20 | High | 45 |
| 1 | 1477685 | 358141 | Blue Ribbon Sushi Izakaya | Japanese | 12.080 | Weekend | NaN | 25 | 23 | Low | 48 |
#lest calculate how many orders take more than 60 mins
df.loc[df['total_time']>60,'total_time'].shape[0] , df.loc[df['total_time']>60,'total_time'].shape[0]/df.shape[0]
(200, 0.1053740779768177)
sns.lineplot(data=df, x='total_time', y='rating', ci=False)
<ipython-input-109-7ea2c51fb74d>:1: FutureWarning:
The `ci` parameter is deprecated. Use `errorbar=('ci', False)` for the same effect.
sns.lineplot(data=df, x='total_time', y='rating', ci=False)
<Axes: xlabel='total_time', ylabel='rating'>
# Write the code here
df.groupby(['day_of_the_week'])['delivery_time'].mean()
day_of_the_week Weekday 28.340 Weekend 22.470 Name: delivery_time, dtype: float64
import plotly.express as px
fig = px.box(df, x="day_of_the_week", y="delivery_time")
fig.show()
fig = px.bar(df, x="delivery_time", y="rating")
fig.show()
The dataset contained 1898 orders for FoodHub which offers access to multiple restaurants in New York for food delivery service. On the basis of analysis done on the dataset, following conclusions can be made: